Faster Joins , Self - Joins and
نویسندگان
چکیده
We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called \Jive-join" developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary les that contain tuple identiiers but no input tuples. Stripe-join performs this eeciently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly eecient for self-joins. To our knowledge, Stripe-join is the rst algorithm that, given a join index and a relation signiicantly larger than main memory, can perform a self-join with just a single pass over the input relation and without storing input tuples in intermediate les. Almost all the I/O is sequential, thus minimizing the impact of seek and rotational latency. The algorithm is resistant to data skew. It can also join multiple relations while still making only a single pass over each input relation. Using a detailed cost model, Stripe-join is analyzed and compared with competing algorithms. For large input relations, Stripe-join performs signiicantly better than Valduriez's algorithm and hash join algorithms. We demonstrate circumstances under which Stripe-join performs signiicantly better than Jive-join. Unlike Jive-join, Stripe-join makes no assumptions about the order of the join index.
منابع مشابه
Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams
We study sliding window multi-join processing in continuous queries over data streams. Several algorithms are reported for performing continuous, incremental joins, under the assumption that all the sliding windows fit in main memory. The algorithms include multiway incremental nested loop joins (NLJs) and multi-way incremental hash joins. We also propose join ordering heuristics to minimize th...
متن کاملFaster Joins, Self Joins and Multi-Way Joins Using Join Indices
We propose a new algorithm called Stripe join for performing a join given a join index Stripe join is inspired by an algorithm called Jive join developed by Li and Ross Stripe join makes a single sequential pass through each input relation in addition to one pass through the join index and two passes through a set of temporary les that contain tuple identi ers but no input tuples Stripe join pe...
متن کاملMemory-Efficient Hash Joins
We present new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. Our hash join is not restricted to outer tables that fit wholly in memory. Key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded populatio...
متن کاملانتخاب مناسبترین زبان پرسوجو برای استفاده از فراپیوندها جهت استخراج دادهها در حالت دیتالوگ در سامانه پایگاه داده استنتاجی DES
Deductive Database systems are designed based on a logical data model. Data (as opposed to Relational Databases Management System (RDBMS) in which data stored in tables) are saved as facts in a Deductive Database system. Datalog Educational System (DES) is a Deductive Database system that Datalog mode is the default mode in this system. It can extract data to use outer joins with three query la...
متن کاملEfficient Skew Handling for Outer Joins in a Cloud Computing Environment
Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding prob...
متن کاملH2RDF+: High-performance distributed joins over large-scale RDF graphs
The proliferation of data in RDF format calls for efficient and scalable solutions for their management. While scalability in the era of big data is a hard requirement, modern systems fail to adapt based on the complexity of the query. Current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this work...
متن کامل